by Cory Lanker, LLNL statistician
June 27, 2017, 10:00-11:30 in B543 R1001
This course is a four-class introduction to predictive modeling. It is for all skill levels and will use R statistics software. But as the course is presented with R notebooks, all the code is here. In this class, I'll discuss the different prediction models and how to code them in R. Tomorrow (at 2:00) we'll discuss the categorical response parallel to today's material: classification. The last class (Thurs) is a prediction contest where teams will use this material in a competition to get the best prediction values.
See the Class 1 notes for assistance installing Jupyter and Anaconda.
In this class:
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PRES-733567
This course is structured around Max Kuhn and Kjell Johnson's master's-level
text on prediction modeling. The code is either from the book, especially when
using their wonderful caret R package, or it's code I wrote. Their code is
disseminated under their copyright. All data sets in this course are all publicly available.
# functions for this page
plotsize <- function(wd,ht){options(repr.plot.width=wd, repr.plot.height=ht)}
packagelist <- c("MASS", "FNN", "ellipse", "AppliedPredictiveModeling", "Hmisc",
"missForest", "caret", "corrplot", "earth", "lattice", "e1071",
"elasticnet", "pls", "doMC", "kernlab", "nnet", "rpart", "ipred",
"Cubist", "gbm", "party", "partykit", "randomForest", "car",
"plyr", "proxy", "RWeka")
# It's likely that RWeka won't load on LLNL machines... that package isn't necessary.
output <- sapply(packagelist, require, character.only = TRUE)
if (any(!output)){
cat('Installing packages: ', packagelist[!output],' ...\n')
install.packages(packagelist[!output])
output <- sapply(packagelist[!output], require, character.only = TRUE)
}
# sessionInfo()
A common metric to assess performance of a regression model (predicts continuous $y$) is to calculate the mean squared error (MSE) between the predictions and the true values: $$MSE = \frac1n\sum_{i=1}^n (\hat{y}_i - y_i)^2.$$
MSE penalizes errors according to the square of the deviation. This means if you are off by 0.1 for each of 100 observations, you have the same MSE if you get 99 exactly correct but off by 1 on the remaining observation: $$\frac1{100} \sum_{i=1}^{100} (0.1)^2 = \frac1{100}\, (100)(.01) = \frac1{100} \left( \sum_{i=1}^{99} (0)^2 + \sum_{i=100}^{100} (1)^2\right).$$ The average absolute error or AAE ($\sum |\hat{y}-y|$) is 0.1 and 0.01, respectively, but the MSE is the same.
The optimizer for MSE is to predict the expected mean $E(Y|X)$, while the optimizer for AAE is to predict the expected median of the distribution
MSE is often reported using its square root, RMSE.
This class will use a data set on solubility data. A drug's solubility plays a factor in whether medications can be administered orally. The data set consists of 1267 compounds and
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Copyright 2013 Kuhn and Johnson
### Web Page: http://www.appliedpredictivemodeling.com
### Contact: Max Kuhn (mxkuhn@gmail.com)
###
### Chapter 6: Linear Regression and Its Cousins
###
### Required packages: AppliedPredictiveModeling, lattice, corrplot, pls,
### elasticnet,
###
### Data used: The solubility from the AppliedPredictiveModeling package
plotsize(6,3)
data(solubility)
cat(dim(solTrainX),"::",length(solTrainY),"::",dim(solTestX),"::",length(solTestY),'\n')
col.factor = sapply(solTrainX[1,],is.factor)
table(col.factor)
plotsize(6,3)
col.unique = sapply(apply(solTrainX,2,unique),length)
cat('There are', min(which(col.unique > 2))-1,'binary predictors (fingerprints). The rest are called descriptors.')
The other 20 predictors are descriptors: 16 count variables (e.g., number of bond types or certain atoms) and 4 continuous predictors.
tabulate(colSums(is.na(rbind(solTrainX, solTestX))))
sum(is.na(solTrainY))
sum(col.unique[1:208] != 2)
plotsize(7,3)
plot(temp <- colMeans(solTrainX[,1:208]),ylab='proportion',ylim=c(0,1),pch=3)
min(temp)
abline(h=0,col='blue')
plotsize(7,9)
par(mfrow=c(4,1), las=1, mai=c(.25,1.25,.25,.125))
boxplot(solTrainX[,209], horizontal=T, main=colnames(solTrainX)[209])
boxplot(solTrainX[,210:224], horizontal=T)
boxplot(solTrainX[,225:226], horizontal=T)
boxplot(solTrainX[,227:228], horizontal=T)
head(solTrainX[,209:228])
Columns 1-208 are binary 0/1, Columns 209 (MolWt), 226:228 (Hydro, SA1, SA2) are continuous, Columns 210:225 are 16 count descriptors.
Let $y$ be $\log(\text{solubility})$. The range, mean, and median are, with a boxplot:
y = solTrainY
cat("range=",range(y)," mean=",mean(y)," median=",median(y))
plotsize(7,2.5)
boxplot(solTrainY, horizontal=TRUE, main="Boxplot of log(Solubility) values")
# show plot of MSE versus constant value, and minimum at mean(Y)
xs = seq(-3.5,-1.75,,176)
mse = sapply(xs, function(x) (1/length(y))*sum((y - x)^2))
# show plot of AAE versus constant value, and minimum at median(Y)
aae = sapply(xs, function(x) (1/length(y))*sum(abs(y-x)))
plotsize(8,3.5)
par(mfrow=c(1,2))
plot(xs,mse,type='l',lwd=2,main='MSE vs. constant',xlab='constant')
abline(v=mean(y),col='red',lwd=2)
abline(v=median(y),col='blue',lwd=2)
legend('topleft',c('mean','median'),lwd=2,col=c(2,4),cex=.75,y.intersp=2.5,box.lwd=.25)
plot(xs,aae,type='l',lwd=2,main='AAE vs. constant',xlab='constant')
abline(v=mean(y),col='red',lwd=2)
abline(v=median(y),col='blue',lwd=2)
# show RMSE calculation using mean value
cat("The best training RMSE using a constant prediction value is",
rmse <- sqrt( (1/length(y))*sum((y-mean(y))^2)))
We may get a sense about RMSE if we show the median $\pm$ RMSE on the boxplot of log(Solubility) values:
plotsize(7,3)
hist(y, 30, main="Mean log(Solubility) +/- RMSE")
abline(v=mean(y),lwd=4,col='darkorange')
abline(v=mean(y)+c(-1,1)*rmse,lwd=3,lty=2,col='darkorange')
Our goal is to predict solubility (continuous) based on the information in the 208 binary/16 count/4 continuous predictors. Only a minuscule portion of the predictor space is populated with our $\mathbf{X}$ observations, and it's only at those locations that we know a $y$. How to find a function of $\mathbf{X}$ approximating $E(y|X)$ (thus minimizing the mean squared error) given such limited data?
There are $2^{208}$ or $4.1 \times 10^{62}$ possible arrangements of these 208 binary variables.
Modeling will only work if there are strong influences on $y$ from the individual values and can consider the predictors separately. There are only $2^{208}$ possibilities of how $y$ is affected if each 208-level interaction is important.
For example, we could have a case of two binary predictors such if $x_1$ or $x_2$ is one, $y$ is likely to be high, if $x_1$ and $x_2$ are both zero, then $y$ takes a low value. In this case, there isn't an interaction effect between $x_1$ and $x_2$ so we can consider the variables separately when forming a prediction for $y$. However, if in addition, when $x_1$ and $x_2$ are both one then $y$ is low, now we must consider all four possibilities of the $x_1, x_2$ space.
solTemp <- solTrainX[,1:208]
solTemp$Y <- solTrainY
solCorr <- cor(solTemp[,c(209,1:208)])
a <- which.max(solCorr[1,2:209])
cat(a, solCorr[1,a+1])
plotsize(7,3.5)
plot(solCorr[1,2:209], ylab='Correlation with y')
abline(h=0,col=3)
plotsize(8,8)
corrplot::corrplot(solCorr, tl.cex=.3, order="hclust") # try running with and without hclust
#corrplot::corrplot(solCorr, tl.cex=.3)
solTemp <- solTrainX[,209:228]
solTemp$Y <- solTrainY
solCorr <- cor(solTemp[,c(21,1:20)])
a <- which.max(solCorr[1,2:21])
cat(a, solCorr[1,a+1])
plotsize(7,3.5)
plot(solCorr[1,2:21], ylab='Correlation with y')
abline(h=0,col=3)
There looks to be a lot of predictive information in these variables.
Our task is to predict solTestX data. While they will not span the exact same space at the training data, it should be close, at least in the ballpark. Let's check the univariate distributions as a basic sanity test.
plotsize(7,6)
plot(colMeans(solTrainX[,1:208]),ylab='proportion',ylim=c(0,1),pch=3)
points(temp <- colMeans(solTestX[,1:208]),ylab='proportion',col=2,ylim=c(0,1),pch=3)
min(temp)
abline(h=0,col='blue')
sum(colMeans(solTrainX[,1:208]) > colMeans(solTestX[,1:208]))/208
Hmm... 67% of the binary predictors have more 1's in the training data. A red flag to potential overfitting.
The statistical test:
cat("p-value of similarity is",2*(1-pbinom(0.5 + 0.668*208, 208, 0.5)))
rv
rv = apply(rbind(solTestX, solTrainX), 2, range)
plotsize(7,9)
par(mfrow=c(4,1), las=1, mai=c(.25,1.25,.25,.125))
boxplot(solTrainX[,209], horizontal=T, main="Training: MolWt", ylim=rv[,209])
boxplot(solTestX[,209], horizontal=T, main="Testing: MolWt", ylim=rv[,209])
is.range = 210:224; yrange = c(min(rv[1,is.range]), max(rv[2,is.range]))
boxplot(solTrainX[,is.range], horizontal=T, main="Training data", ylim=yrange)
boxplot(solTestX[,is.range], horizontal=T, main="Test data", ylim=yrange)
is.range = 225:226; yrange = c(min(rv[1,is.range]), max(rv[2,is.range]))
boxplot(solTrainX[,is.range], horizontal=T, main="Training data", ylim=yrange)
boxplot(solTestX[,is.range], horizontal=T, main="Test data", ylim=yrange)
is.range = 227:228; yrange = c(min(rv[1,is.range]), max(rv[2,is.range]))
boxplot(solTrainX[,is.range], horizontal=T, main="Training data", ylim=yrange)
boxplot(solTestX[,is.range], horizontal=T, main="Test data", ylim=yrange)
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 6.1 Case Study: Quantitative Structure- Activity
### Relationship Modeling
### Some initial plots of the data
plotsize(6,4)
xyplot(solTrainY ~ solTrainX$MolWeight, type = c("p", "g"),
ylab = "Solubility (log)",
main = "(a)",
xlab = "Molecular Weight")
xyplot(solTrainY ~ solTrainX$NumRotBonds, type = c("p", "g"),
ylab = "Solubility (log)",
xlab = "Number of Rotatable Bonds")
bwplot(solTrainY ~ ifelse(solTrainX[,100] == 1,
"structure present",
"structure absent"),
ylab = "Solubility (log)",
main = "(b)",
horizontal = FALSE)
# (C) Kuhn and Johnson (2013)
### Find the columns that are not fingerprints (i.e. the continuous
### predictors). grep will return a list of integers corresponding to
### column names that contain the pattern "FP".
isFingerprints <- grep("FP", names(solTrainXtrans))
notFingerprints <- setdiff(1:ncol(solTrainXtrans),isFingerprints)
plotsize(8,5)
featurePlot(solTrainXtrans[, notFingerprints[c(1:7,9:11,13,17)]],
solTrainY,
between = list(x = 1, y = 1),
type = c("g", "p", "smooth"),
labels = rep("", 2))
notFingerprints
# (C) Kuhn and Johnson (2013)
### We used the full namespace to call this function because the pls
### package (also used in this chapter) has a function with the same
### name.
plotsize(6,6)
corrplot::corrplot(cor(solTrainXtrans[, notFingerprints]),
order = "hclust",
tl.cex = .8)
Linear Regression employs the least squares solution $\mathbf{b} = (X'X)^{-1} X'y$. Notice that formula includes $(X'X)^{-1}$, proportional to the covariance matrix of $\mathbf{X}$. This particular calculation is sensitive to multicollinearity, inducing variability of the estimates in $\mathbf{b}$. We can use the Variance Inflation Factor (VIF) to monitor potential problems with the matrix we use to calculate the linear regression estimate $\mathbf{b}$.
Note: with limited data relative to the number of predictors $p$ (which includes all data sets today), influential observations greatly affect $\mathbf{b}$. We can try a robust regression technique to see if we get different results (e.g., Huber residuals that use a squared penalty up to a maxium penalty value).
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 6.2 Linear Regression
### Create a control function that will be used across models. We
### create the fold assignments explicitly instead of relying on the
### random number seed being set to identical values.
set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE, k=4) # for speed purposes, I use 4-fold CV
str(indx)
ctrl <- trainControl(method = "cv", index = indx)
ctrl$number = length(ctrl$index) # for some reason, this isn't set correctly in the function trainControl
str(ctrl)
set.seed(100)
indx20 <- createFolds(solTrainY, returnTrain = TRUE, k=20)
ctrl20 <- trainControl(method = "cv", index = indx20)
We're using 10 folds in CV (ideally, still too long for some models, so we use 4-fold CV). Some methods take longer to train. You could reduce the nubmer of folds to the 3 to 5 range and perhaps not suffer too much degradation of CV tuning parameter estimates.
# (C) Kuhn and Johnson, 2013
### Linear regression model with all of the predictors. This will
### produce some warnings that a 'rank-deficient fit may be
### misleading'. This is related to the predictors being so highly
### correlated that some of the math has broken down.
set.seed(100)
lmTune0 <- train(x = solTrainXtrans, y = solTrainY,
method = "lm",
trControl = ctrl)
lmTune0
# the danger of 4-fold CV:
lmTune0$resample
# what does (X'X)^-1 look like?
plotx = as.matrix(solTrainX)
ev = eigen(t(plotx) %*% plotx)
str(ev)
plotsize(7,4)
plot(log10(ev$values))
ev$values[220:228]
There are two problems causing the tiny eigenvalues: (1) NumAtoms/NumBonds, (2) NonHAtoms/Bonds
cor(solTrainX$NumAtoms, solTrainX$NumBonds)
cor(solTrainX$NumNonHAtoms, solTrainX$NumNonHBonds)
col.remove = (colnames(solTrainX) %in% c('NumBonds', 'NumNonHBonds'))
which(col.remove)
# (C) Kuhn and Johnson, 2013
### And another using a set of predictors reduced by unsupervised
### filtering. We apply a filter to reduce extreme between-predictor
### correlations. Note the lack of warnings.
tooHigh <- findCorrelation(cor(solTrainXtrans), .95)
tooHigh
Now running regression on solTrainXtrans, we don't have multicollinearity warnings:
trainXfiltered <- solTrainXtrans[, -tooHigh]
testXfiltered <- solTestXtrans[, -tooHigh]
set.seed(100)
lmTune <- train(x = trainXfiltered, y = solTrainY,
method = "lm",
trControl = ctrl)
lmTune
### Save the test set results in a data frame
testResults <- data.frame(obs = solTestY,
Linear_Regression = predict(lmTune, testXfiltered))
summary(lmTune)
# The resampling Rsquared was 0.87 while the full model
# training Rsquared is much higher at 0.94
Stepwise regression adds and removes variables based on maximizing likelihood relative to a coefficient estimation penalty. The default uses AIC. To speed up the computation, I select only the highest 1/3 of $|t|$ scores in the regression (using the following code block):
trainingData <- solTrainXtrans
trainingData$Solubility <- solTrainY
lmFull <- lm(Solubility ~ ., data=trainingData)
out <- summary(lmFull)
keyvars <- order(out$coefficients[,3])
keyvars <- c(setdiff(keyvars, 1)) # remove intercept (if present)
trainingData <- trainingData[,keyvars[1:80]-1]
trainingData$Solubility <- solTrainY
lmFull <- lm(Solubility ~ ., data=trainingData)
lmReduced <- step(lmFull, trace=0, k = log(nrow(trainingData)))
In the following stepwise regression, FP154 doesn't look like a real effect (just fitting noise) while a variable like FP044 does look significant.
plotsize(8,6)
par(las=1)
avPlots(lmReduced)
Xmat = model.matrix(lmReduced)
dim(Xmat)
head(Xmat)
colnames(Xmat)
Xmat = as.data.frame(Xmat)
Xmat$Solubility = trainingData$Solubility
varFP154 = lm(FP154 ~ . - Solubility, data=Xmat)
varY = lm(Solubility ~ . - FP154, data=Xmat)
KWdata = data.frame(Q = varY$resid, C = (varFP154$resid > 0.4))
kruskal.test(Q ~ C, data=KWdata)
plotsize(4,4)
plot(varFP154$resid, varY$resid)
set.seed(0)
pvals = rep(NA,200)
varY = lm(Solubility ~ . - FP154, data=Xmat)
KWdata = data.frame(Q = varY$resid, C = (varFP154$resid > 0.4))
for (i in 1:200){
out = kruskal.test(Q ~ C, data=KWdata[sample(nrow(KWdata), nrow(KWdata), replace=TRUE),])
pvals[i] = out$p.value
}
plotsize(6,4)
hist(pvals,25)
varFP044 = lm(FP023 ~ . - Solubility, data=Xmat)
varY = lm(Solubility ~ . - FP023, data=Xmat)
KWdata = data.frame(Q = varY$resid, C = (varFP044$resid > 0.5))
set.seed(0)
pvals = rep(NA,200)
for (i in 1:200){
out = kruskal.test(Q ~ C, data=KWdata[sample(nrow(KWdata), nrow(KWdata), replace=TRUE),])
pvals[i] = out$p.value
}
hist(pvals, 25)
summary(lmReduced)
A nice result from stepwise regression is the above summary agrees with our assessment from the added-variable plots. Note that fingerprints 19, 133, 148, and 154 don't look significant and have relative large p-values (for t-test of coefficient significance). Compare with the above lmFull model summary with all coefficient with p-values > 0.01 removed:
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.88083 -0.30042 0.00874 0.31417 1.64368
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.531279 1.246753 7.645 6.47e-14 ***
FP040 0.505803 0.176207 2.871 0.004215 **
FP061 -0.491600 0.137319 -3.580 0.000366 ***
FP073 -0.590945 0.209965 -2.814 0.005015 **
FP076 0.726063 0.170738 4.252 2.38e-05 ***
FP079 0.625276 0.187041 3.343 0.000870 ***
FP081 -0.293340 0.108986 -2.692 0.007272 **
FP082 0.273170 0.097349 2.806 0.005146 **
FP085 -0.403105 0.143853 -2.802 0.005208 **
FP087 0.692471 0.263050 2.632 0.008653 **
FP096 -0.409433 0.146340 -2.798 0.005278 **
FP109 0.663429 0.221248 2.999 0.002803 **
FP111 -0.443722 0.144305 -3.075 0.002183 **
FP119 0.780114 0.270026 2.889 0.003977 **
FP127 -0.725249 0.171165 -4.237 2.55e-05 ***
FP130 -1.033193 0.379176 -2.725 0.006585 **
FP142 0.572553 0.149489 3.830 0.000139 ***
FP143 0.751869 0.285099 2.637 0.008533 **
FP154 -0.825790 0.200460 -4.119 4.22e-05 ***
FP164 0.607709 0.192168 3.162 0.001629 **
FP171 0.423581 0.120090 3.527 0.000446 ***
FP176 0.797610 0.253882 3.142 0.001747 **
FP202 0.480010 0.180908 2.653 0.008140 **
MolWeight -1.422025 0.231033 -6.155 1.23e-09 ***
NumRotBonds -0.460000 0.133451 -3.447 0.000599 ***
NumHydrogen 0.497291 0.157906 3.149 0.001702 **
NumOxygen 1.493861 0.401794 3.718 0.000216 ***
NumRings -1.887635 0.441071 -4.280 2.12e-05 ***
SurfaceArea1 0.278697 0.039748 7.012 5.30e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5806 on 743 degrees of freedom
Multiple R-squared: 0.9371, Adjusted R-squared: 0.9195
F-statistic: 53.43 on 207 and 743 DF, p-value: < 2.2e-16
# Need to find the top 1/3 of variables for stepwise regression
# Perform stepwise on smaller datasets and keep track of those in final model
coltimes = rep(0, ncol(solTrainXtrans))
for (k in 1:20){
#cat("starting CV division",k,"...\n")
set.seed(k)
colindex = sample(rep(1:10, length=ncol(solTrainXtrans)))
for (j in 1:max(colindex)){
trdat = solTrainXtrans[,colindex == j]
trdat$Y = solTrainY
out.full = lm(Y ~ ., data=trdat)
out.step = step(out.full, trace=0, k=log(nrow(trdat)))
inmodel = (colnames(solTrainXtrans) %in% colnames(out.step$model))
coltimes[inmodel] = coltimes[inmodel] + 1
}
}
# [less than 1 minute to run]
plotsize(6,5)
boxplot(cor(solTrainXtrans, solTrainY) ~ coltimes)
abline(h=0, col='red')
sum(coltimes >= 10)
trainingData <- solTrainXtrans[, coltimes >= 10]
testingData <- solTestXtrans[, coltimes >= 10]
trainingData$Solubility <- solTrainY
lmFull <- lm(Solubility ~ ., data=trainingData)
out <- summary(lmFull)
starttime <- proc.time()[3]
lmReducedVS <- step(lmFull, trace=0, k = log(nrow(trainingData)))
proc.time()[3] - starttime
summary(lmReducedVS)
plotsize(8,6)
par(las=1)
avPlots(lmReducedVS)
testResults$step = predict(lmReduced, solTestXtrans)
testResults$stepVS = predict(lmReducedVS, solTestXtrans)
cor(testResults)
a = testResults
for (j in 2:4)
cat(colnames(a)[j], 1 - sum((a[,1] - a[,j])^2)/sum((a[,1]-mean(a[,1]))^2),'\n')
Using a rough variable screening technique yields better results (in a reasonable amount of time). Variable screening techniques become more important as the number of predictors brings some computationally-intense methods to a crawl. Note that the stepVS achieves an $R^2$ of 0.869 while Linear_Regression achieves only 0.866. Compare this to the estimated R2 and R2-adjusted values:
| method | est. $R^2$ | $R^2$-$adj$ | actual $R^2$ |
|---|---|---|---|
| Linear Regression | 0.937 | 0.920 | 0.866 |
| Step Regression | 0.850 | 0.846 | 0.821 |
| Step with VS | 0.912 | 0.908 | 0.871 |
There is likely something fundamentally different between the training set and the testing set.
Check for outliers, nonrandomness, other structure, or nonnormality:
plotsize(4,4)
xyplot(solTrainY ~ predict(lmTune), type=c("p","g"), #plot points (p) and grid (g)
xlab="Predicted", ylab="Observed")
xyplot(resid(lmTune) ~ predict(lmTune), type=c("p","g"), #plot points (p) and grid (g)
xlab="Predicted", ylab="Residuals")
plotsize(6,3)
hist(resid(lmTune))
# warning: don't rely on shapiro.test... everything is nonnormal with enough data
shapiro.test(resid(lmTune))
# the test suggest rejecting H0: {normal errors} in favor of H1: {lack of normality}
plotsize(7,3)
par(mfrow=c(1,3))
for (j in 1:3){
sampsize = 5*10^j
pvals = rep(NA, 1000)
for (k in 1:length(pvals)){
set.seed(k)
a <- rnorm(sampsize)
pvals[k] = shapiro.test(a)$p.value
}
hist(pvals, main=sprintf('Non-normality test p-values\nfor size = %i',sampsize))
}
plotsize(7,3)
par(mfrow=c(1,3))
for (j in 1:3){
sampsize = 5*10^j
pvals = rep(NA, 1000)
for (k in 1:length(pvals)){
set.seed(k)
a <- rt(sampsize,30)
pvals[k] = shapiro.test(a)$p.value
}
hist(pvals, main=sprintf('Non-normality test p-values\nfor size = %i',sampsize))
}
plotsize(8,5)
par(mfrow=c(1,2),las=1)
plot(a <- -400:400/100, pnorm(a), type='l',lwd=3,col='red',xlab='x',ylab='density',
main='cumulative distribution function')
lines(a, pt(a,30), lwd=1.5)
plot(a <- -400:400/100, pnorm(a), type='l',lwd=3,col='red',log='y',xlab='x',
main='cumulative distribution function')
lines(a, pt(a,30), lwd=1.5)
seen another way, standard normals rounded to the nearest 0.1 lead to a nonnormal test result for large samples
plotsize(7,3)
par(mfrow=c(1,3))
for (j in 1:3){
sampsize = 5*10^j
pvals = rep(NA, 1000)
for (k in 1:length(pvals)){
set.seed(k)
a <- round(rnorm(sampsize),1)
pvals[k] = shapiro.test(a)$p.value
}
hist(pvals, main=sprintf('Non-normality test p-values\nfor size = %i',sampsize), breaks=0:10/10)
}
To get a favor of how complicated regression in caret can be:
rlmPCA <- train(solTrainXtrans, solTrainY, method="rlm", # robust linear model
preProcess = "pca", # using PCA preprocessing, as in Class 1
trControl = ctrl)
rlmPCA
testResults$RLM <- predict(rlmPCA, solTestXtrans)
Singular value decomposition is used to decompose $\mathbf{X}$ ($n \times p$ matrix) into a orthonormal matrix ($p \times p$) that best explains the row space of $\mathbf{X}$ (the $\mathbb{R}^p$ space spanned by the rows of $\mathbf{X}$)
Then we see how few of these orthonormal vectors can still yield a good prediction model for $y$. Note that how the vectors correlate with $y$ are not considered in coming up with the machinery of the model. If we decompose $\mathbf{X}'\mathbf{X}$ ($p \times p$ matrix), related to the covariance of $\mathbf{X}$, we get eigenvectors (the same orthonormal vectors mentioned above) and eigenvalues. It's the eignevectors of $\mathbf{X}'\mathbf{X}$ that are used as regressors in PCR.
Contrasting with the $y$-indifference of PCR is PLSR, where regressors are determined by their ability to best correlate with $y$. Then we see how few of these $y$-correlated orthonormal vectors can still yield a good prediction model for $y$.
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 6.3 Partial Least Squares
## Run PLS and PCR on solubility data and compare results
set.seed(100)
plsTune <- train(x = solTrainXtrans, y = solTrainY,
method = "pls",
tuneGrid = expand.grid(ncomp = 1:20),
trControl = ctrl,
preProc = c("center", "scale"))
plsTune
testResults$PLS <- predict(plsTune, solTestXtrans)
set.seed(100)
pcrTune <- train(x = solTrainXtrans, y = solTrainY,
method = "pcr",
tuneGrid = expand.grid(ncomp = 1:40),
trControl = ctrl)
# technically, minimum at 150
pcrTune
plsResamples <- plsTune$results
plsResamples$Model <- "PLS"
pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"
plsPlotData <- rbind(plsResamples, pcrResamples)
xyplot(RMSE ~ ncomp,
data = plsPlotData,
#aspect = 1,
xlab = "# Components",
ylab = "RMSE (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
type = c("o", "g"))
plsImp <- varImp(plsTune, scale = FALSE)
plotsize(7,5)
plot(plsImp, top = 25, scales = list(y = list(cex = .95)))
testResults$PCR <- predict(pcrTune, solTestXtrans)
plot(testResults)
cor(testResults)
Ridge regression finds the linear fit coefficients $\mathbf{b}$ that minimize $SSE + \lambda \sum_j b_j^2$. This results in a shrinking of the OLS estimate towards zero. Note that $\lambda$ is a penalty to be tuned: if $\lambda = 0$ then there is no penalty, resulting in the OLS estimate. If $\lambda$ is very large, then the fit will only include the intercept.
LASSO regression minimizes $SSE + \lambda \sum_j \left|b_j\right|$. This results in a sparse model as coefficients will often be zeroed out to reduce the fit's penalty.
An elastic net is a combination (and is superior) to both of these penalized regression models. Tune an additional parameter $\alpha \in [0,1]$ that emphasizes LASSO ($\alpha=0$) or Ridge ($\alpha=1$).
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 6.4 Penalized Models
## The text used the elasticnet to obtain a ridge regression model.
## There is now a simple ridge regression method.
# ridgeGrid <- expand.grid(lambda = seq(0, .1, length = 15))
# this is the original call, resulting in:
Ridge Regression
951 samples
228 predictors
Pre-processing: centered (228), scaled (228)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared
0.000000000 0.7207131 0.8769711
0.007142857 0.7047552 0.8818659
0.014285714 0.6964731 0.8847911
0.021428571 0.6925923 0.8862699
0.028571429 0.6908607 0.8870609
0.035714286 0.6904220 0.8874561
0.042857143 0.6908548 0.8875998
0.050000000 0.6919152 0.8875759
0.057142857 0.6934719 0.8874300
0.064285714 0.6954114 0.8872009
0.071428571 0.6976723 0.8869096
0.078571429 0.7002069 0.8865723
0.085714286 0.7029801 0.8862009
0.092857143 0.7059656 0.8858041
0.100000000 0.7091432 0.8853885
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was lambda = 0.03571429.
ridgeGrid <- expand.grid(lambda = seq(.03, .05, length = 5))
set.seed(100)
ridgeTune <- train(x = solTrainXtrans, y = solTrainY,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c("center", "scale"))
ridgeTune
Questions:
print(update(plot(ridgeTune), xlab = "Penalty"))
testResults$Ridge <- predict(ridgeTune, solTestXtrans)
enetGrid <- expand.grid(lambda = c(0, 0.01, .1),
fraction = seq(.05, 1, length = 20))
set.seed(100)
starttime <- proc.time()[3]
enetTune <- train(x = solTrainXtrans, y = solTrainY,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
proc.time()[3] - starttime
enetTune
plot(enetTune)
testResults$Enet <- predict(enetTune, solTestXtrans)
enetGrid2 <- expand.grid(lambda = c(.01, .02),
fraction = seq(.4, .6, length = 5))
enetTune2 <- train(x = solTrainXtrans, y = solTrainY,
method = "enet",
tuneGrid = enetGrid2,
trControl = ctrl,
preProc = c("center", "scale"))
enetTune2
plot(enetTune2)
testResults$Enet2 <- predict(enetTune2, solTestXtrans)
cor(testResults)
cat("Training Errors:\n")
fullpred <- predict(enetTune, solTrainXtrans)
cat('Enet: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(ridgeTune, solTrainXtrans)
cat('Ridge:',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(lmReducedVS, solTrainXtrans)
cat('Step: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),":\n")
fullpred <- predict(lmTune, solTrainXtrans)
cat('LM: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(rlmPCA, solTrainXtrans)
cat('RLM: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2)," : ")
fullpred <- predict(plsTune, solTrainXtrans)
cat('PLSR: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(pcrTune, solTrainXtrans)
cat('PCR: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),"\n")
cat("Testing Errors:\n")
fullpred <- predict(enetTune, solTestXtrans)
cat('Enet: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2)," : ")
fullpred <- predict(ridgeTune, solTestXtrans)
cat('Ridge:',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(lmReducedVS, solTestXtrans)
cat('Step: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),":\n")
fullpred <- predict(lmTune, solTestXtrans)
cat('LM: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(rlmPCA, solTestXtrans)
cat('RLM: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(plsTune, solTestXtrans)
cat('PLSR: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(pcrTune, solTestXtrans)
cat('PCR: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),"\n")
The "traditional" regression methods seem to achieve a training $R^2$ of 0.88-0.89 when predicting solubility (estimated with 10-fold cross-validation). Using cross-validation gives a more accurate assessment of how the method performs on future data as those predicted values didn't include their own data in estimating model coefficients. Contrast this to the training error from the full model:
Neural networks embed a hidden structure that is additive according to some transform of variables. Each variable and hidden layer add a large number of coefficients to estimate, so tuning is very important.
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 7.1 Neural Networks
### Optional: parallel processing can be used via the 'do' packages,
### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
### up the computations.
### WARNING: Be aware of how much memory is needed to parallel
### process. It can very quickly overwhelm the availible hardware. We
### estimate the memory usuage (VSIZE = total memory size) to be
### 2677M/core.
registerDoMC(4)
# # this took too long to run (not complete after 1 hour)
# nnetGrid <- expand.grid(decay = c(0, 0.01, .1),
# size = c(1, 3, 5, 7, 9, 11, 13),
# bag = FALSE)
# advice when tuning: try an initial grid that is very broad and sparse,
# like decay = c(0, .5, 10) and size = c(3, 9, 17)
# and then fine-tune based on results
nnetGrid <- expand.grid(size = c(1, 4, 7, 10), decay = c(0.01, 0.1, 1, 10))
# bag = FALSE)
nnetGrid
set.seed(100)
indx2 <- createFolds(solTrainY, returnTrain = TRUE, k=2) # for speed purposes, I use 4-fold CV
str(indx2)
ctrl2 <- trainControl(method = "cv", index = indx2)
ctrl2$number = length(ctrl2$index) # for some reason, this isn't set correctly in the function trainControl
which(colnames(trainingData) == "Solubility")
dim(trainingData); dim(testingData)
set.seed(100)
starttime <- proc.time()[3]
nnetTune <- train(x = trainingData[,-126], y = trainingData[,126],
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl2,
preProc = c("center", "scale"),
linout = TRUE,
trace = TRUE,
#MaxNWts = 13 * (ncol(solTrainXtrans) + 1) + 13 + 1,
MaxNWts = 7 * (ncol(testingData) + 1) + 7 + 1,
maxit = 250,
allowParallel = FALSE)
proc.time()[3] - starttime
nnetTune
plot(nnetTune)
nnetGrid2 <- rbind(
expand.grid(size = c(1, 3), decay = c(0.05, 0.2, 0.8)),
expand.grid(size = c(6, 8), decay = c(0.5, 2, 8)))
nnetGrid2
set.seed(100)
starttime <- proc.time()[3]
nnetTune2 <- train(x = trainingData[,-126], y = trainingData[,126],
method = "nnet",
tuneGrid = nnetGrid2,
trControl = ctrl2,
preProc = c("center", "scale"),
linout = TRUE,
trace = TRUE,
MaxNWts = 7 * (ncol(testingData) + 1) + 7 + 1,
maxit = 250,
allowParallel = FALSE)
proc.time()[3] - starttime
nnetTune2
plot(nnetTune2)
nnetGrid3 <- rbind(
expand.grid(size = c(3, 5), decay = c(0.5, 1.5, 5)),
expand.grid(size = c(5, 7), decay = c(3, 9, 27)))
nnetGrid3
set.seed(100)
starttime <- proc.time()[3]
nnetTune3 <- train(x = trainingData[,-126], y = trainingData[,126],
method = "nnet",
tuneGrid = nnetGrid3,
trControl = ctrl2,
preProc = c("center", "scale"),
linout = TRUE,
trace = TRUE,
MaxNWts = 7 * (ncol(testingData) + 1) + 7 + 1,
maxit = 1000,
allowParallel = FALSE)
proc.time()[3] - starttime
nnetTune3
plot(nnetTune3)
We're getting close, let's concentrate on 5 and 7 nodes and the decay range 1 to 8
nnetGrid4 <- expand.grid(size = c(5, 7), decay = c(1, 2, 4, 8))
t(nnetGrid4)
set.seed(100)
starttime <- proc.time()[3]
nnetTune4 <- train(x = trainingData[,-126], y = trainingData[,126],
method = "nnet",
tuneGrid = nnetGrid4,
trControl = ctrl2,
preProc = c("center", "scale"),
linout = TRUE,
trace = TRUE,
MaxNWts = 11 * (ncol(testingData) + 1) + 11 + 1,
maxit = 1500,
allowParallel = FALSE)
proc.time()[3] - starttime
nnetTune4
plot(nnetTune4)
nnetGrid5 <- expand.grid(size = 7, decay = c(5, 8))
t(nnetGrid5)
set.seed(100)
starttime <- proc.time()[3]
nnetTune5 <- train(x = trainingData[,-126], y = trainingData[,126],
method = "nnet",
tuneGrid = nnetGrid5,
trControl = ctrl, # back to 4-fold CV
preProc = c("center", "scale"),
linout = TRUE,
trace = TRUE,
MaxNWts = 13 * (ncol(testingData) + 1) + 13 + 1,
maxit = 2000,
allowParallel = FALSE)
proc.time()[3] - starttime
nnetTune5
plot(nnetTune5)
testResults$NNet = predict(nnetTune5, testingData)
cor(testResults[,c(1,8:11)])
To use another implementation avNNet, the correlated predictors should be removed.
# (C) Kuhn and Johnson, 2013
tooHigh <- findCorrelation(cor(solTrainXtrans), cutoff=.75)
trainXnnet <- solTrainXtrans[, -tooHigh]
testXnnet <- solTestXtrans[, -tooHigh]
startTime <- proc.time()[3]
# nnetGrid <- expand.grid(decay = c(0, 0.01, .1),
# size = c(1, 3, 5, 7, 9, 11, 13),
# bag = TRUE)
broadGrid <- expand.grid(decay = c(1, 10),
size = 7,
bag = TRUE)
broadGrid
# set.seed(100)
# when I mistakenly ran on solTrainXtrans, only [1.5, 5] completed without NA
# so I assume using trainXnnet for avNNet is required
avNNetTune <- train(trainXnnet, solTrainY, method="avNNet",
tuneGrid = broadGrid, preProc = c("center", "scale"),
linout = TRUE, trace = TRUE,
MaxNWts = 9*(ncol(trainXnnet) + 1) + 9 + 1,
maxit = 50)
cat(sprintf('Total time %.1f', proc.time()[3]-startTime))
avNNetTune
( ! ) We can see that avNNet takes a lot of time. I'll defer to the experts on implementing this method...
# Bagging doesn't take more time but seems to give poorer results...
startTime <- proc.time()[3]
# nnetGrid <- expand.grid(decay = c(0, 0.01, .1),
# size = c(1, 3, 5, 7, 9, 11, 13),
# bag = TRUE)
broadGrid <- expand.grid(decay = c(0.1, 1, 10),
size = c(7, 3, 11),
bag = FALSE)
broadGrid = broadGrid[c(1:3,5,8),]
broadGrid
# set.seed(100)
# when I mistakenly ran on solTrainXtrans, only [1.5, 5] completed without NA
# so I assume using trainXnnet for avNNet is required (remove correlated var.)
avNNetTune2 <- train(trainXnnet, solTrainY, method="avNNet",
tuneGrid = broadGrid, preProc = c("center", "scale"),
linout = TRUE, trace = TRUE,
MaxNWts = 10*(ncol(trainXnnet) + 1) + 10 + 1,
maxit = 100)
cat(sprintf('Total time %.1f', proc.time()[3]-startTime))
avNNetTune2
# checking to see changes in results with 1000 iterations for 3 cases in avNNetTune
startTime <- proc.time()[3]
# nnetGrid <- expand.grid(decay = c(0, 0.01, .1),
# size = c(1, 3, 5, 7, 9, 11, 13),
# bag = TRUE)
broadGrid <- expand.grid(decay = c(0.1, 1, 10),
size = c(7, 3, 11),
bag = TRUE)
broadGrid = broadGrid[c(1,3,5),]
broadGrid
# set.seed(100)
# when I mistakenly ran on solTrainXtrans, only [1.5, 5] completed without NA
# so I assume using trainXnnet for avNNet is required
avNNetTune3 <- train(trainXnnet, solTrainY, method="avNNet",
tuneGrid = broadGrid, preProc = c("center", "scale"),
linout = TRUE, trace = TRUE,
MaxNWts = 10*(ncol(trainXnnet) + 1) + 10 + 1,
maxit = 1000)
cat(sprintf('Total time %.1f', proc.time()[3]-startTime))
avNNetTune3
The maxit = 100 results were:
decay size RMSE Rsquared
0.1 7 0.9321281 0.8002468
1.0 3 0.8950329 0.8110429
10.0 7 0.8151765 0.8436004
# may not be the most optimal but should give good results
broadGrid <- expand.grid(decay = c(0.5),
size = c(6, 8),
bag = TRUE)
set.seed(100)
startTime <- proc.time()[3]
avNNetTuneF <- train(trainXnnet, solTrainY, method="avNNet",
tuneGrid = broadGrid, preProc = c("center", "scale"),
linout = TRUE, trace = TRUE,
MaxNWts = 10*(ncol(trainXnnet) + 1) + 10 + 1,
maxit = 1500)
cat(sprintf('Total time %.1f', proc.time()[3]-startTime))
avNNetTuneF
#testResults$avNNet = predict(avNNetTuneF, testXnnet)
plotsize(7,4)
plot(avNNetTuneF)
Splines are a flexible model to fit a nonlinear regression function ($E(Y|X)$). MARS uses a collection of hockey-stick functions (hinge functions) to fit the response surface: $$h(x) = x\, \mathbf{I}[x > 0]$$ The nodes for the individual hinge functions are actually $h(\mathbf{x} - \mathbf{c})$ (the points $\mathbf{c}$ where the hinge functions start increasing), must be estimated along with the coefficients for how the $h(x)$ fit the regression function.
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 7.2 Multivariate Adaptive Regression Splines
set.seed(100)
# first a wide sparse tuneGrid
marsTune0 <- train(x = solTrainXtrans, y = solTrainY,
method = "earth",
tuneGrid = expand.grid(degree = c(1,2,3), nprune = c(3,9,27,81)),
trControl = ctrl)
marsTune0
plot(marsTune0)
set.seed(100)
# expanding tuneGrid
marsTune1 <- train(x = solTrainXtrans, y = solTrainY,
method = "earth",
tuneGrid = expand.grid(degree = c(1,2,3), nprune = c(60,120)),
#tuneGrid = expand.grid(degree = c(1,2,3), nprune = round(30*c(1,1.25,1.25^2,1.25^3))),
trControl = ctrl)
marsTune1
plot(marsTune1)
set.seed(100)
# finer tuneGrid
tuneval <- expand.grid(nprune = 10*(3:9), degree = c(2,1,3))
tuneval = tuneval[c(1:8,10,12,16,19),]
t(tuneval)
marsTune <- train(x = solTrainXtrans, y = solTrainY,
method = "earth",
tuneGrid = tuneval,
trControl = ctrl20)
marsTune
# takes two minutes (but worth it)
plot(marsTune)
summary(marsTune)
testResults$MARS <- predict(marsTune, solTestXtrans)
cor(testResults[,c(1,8:12)])
The improvement from degree=3 seems to come out of nowhere. This is one of the problems with using a small number of folds, the data are more limited for each fold's predictions. The additional data are helpful in finding degree=3 effects which are obviously real due to the great test set results. To compare with 4-fold results:
marsTune4Fold <- train(x = solTrainXtrans, y = solTrainY,
method = "earth",
tuneGrid = tuneval,
trControl = ctrl)
marsTune4Fold
# takes one minute
plot(marsTune4Fold)
An added bonus, predictor importance measures are available from the MARS fits:
marsImp <- varImp(marsTune, scale = FALSE)
plotsize(7,5)
plot(marsImp, top = 25)
Just like neural networks, maybe more so, it is very important to tune SVMs and their many model parameters. Below a large sweep of cost values is shown over several orders of magnitude -- a good idea! SVM model parameters cover a large region unlike some methods.
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 7.3 Support Vector Machines
## In a recent update to caret, the method to estimate the
## sigma parameter was slightly changed. These results will
## slightly differ from the text for that reason.
set.seed(100)
svmRTune <- train(x = solTrainXtrans, y = solTrainY,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = ctrl)
svmRTune
svmRTune$finalModel
plot(svmRTune, scales = list(x = list(log = 2)))
# but can sigma be tuned in the radial basis kernel?
Rtuneval <- expand.grid(sigma = c(.0027, .0009, .0081), C = 5^(-1:4))
set.seed(100)
svmRTune1 <- train(x = solTrainXtrans, y = solTrainY,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = Rtuneval,
trControl = ctrl)
svmRTune1
plot(svmRTune1, scales = list(x = list(log = 2)))
Choice of sigma seems adequate, maybe I'd rerun with sigma = .002.
...But it's always good to know where the overfitting scale is for each parameter, and reducing sigma exposes us to overfitting the particulars in our training data, just like choosing $k=1$ in nearest neighbors is exposed to overfitting. That said, sigma = .0027 might be just right.
svmGrid <- expand.grid(degree = 1:2,
scale = c(0.025, 0.005, 0.001),
C = 10^(-1:2))
set.seed(100)
svmPTune <- train(x = solTrainXtrans, y = solTrainY,
method = "svmPoly",
preProc = c("center", "scale"),
tuneGrid = svmGrid,
trControl = ctrl)
svmPTune
plot(svmPTune,
scales = list(x = list(log = 2),
between = list(x = .5, y = 1)))
Which values for degree, scale, and cost did the program select? Are these the ones you would select?
svmGrid <- expand.grid( scale = c(0.02, 0.01),
C = 4^(-1:1), degree = c(2,3))
svmGrid = svmGrid[c(1:6,9,12),]
# for degree 3: choosing C=4 with scale=0.01 (not too low) and C=1 with scale=0.02
svmGrid
set.seed(100)
svmPTune <- train(x = solTrainXtrans, y = solTrainY,
method = "svmPoly",
preProc = c("center", "scale"),
tuneGrid = svmGrid,
trControl = ctrl)
svmPTune
plot(svmPTune,
scales = list(x = list(log = 2),
between = list(x = .5, y = 1)))
Based on these CV results, I choose scale=0.01 (better than 0.005), Cost=2, degree=2 (not worth jumping the complexity to degree 3 for almost no benefit)
svmPreProc <- preProcess(solTrainXtrans, c("center","scale"))
solTrainXsvm <- predict(svmPreProc, solTrainXtrans)
solTestXsvm <- predict(svmPreProc, solTestXtrans)
set.seed(100)
svmPTuneF <- svm(x = solTrainXsvm, y = solTrainY,
C=2, degree=2, scale=0.01, kernel="polynomial",
preProc = c("center", "scale"))
svmPTuneF
testResults$SVMr <- predict(svmRTune, solTestXtrans)
testResults$SVMp <- predict(svmPTuneF, solTestXsvm)
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 7.4 K-Nearest Neighbors
### First we remove near-zero variance predictors
knnDescr <- solTrainXtrans[, -nearZeroVar(solTrainXtrans)]
set.seed(100)
knnTune <- train(x = knnDescr, y = solTrainY,
method = "knn",
preProc = c("center", "scale"),
tuneGrid = data.frame(k = 1:20),
trControl = ctrl)
knnTune
testResults$Knn <- predict(knnTune, solTestXtrans[, names(knnDescr)])
plot(knnTune)
# I'd use k=8 based on those CV results
set.seed(100)
indx50 <- createFolds(solTrainY, returnTrain = TRUE, k=50)
ctrl50 <- trainControl(method = "cv", index = indx50)
set.seed(100)
knnTuneF <- train(x = knnDescr, y = solTrainY,
method = "knn",
preProc = c("center", "scale"),
tuneGrid = data.frame(k = 6:10),
trControl = ctrl50)
knnTuneF
dim(knnDescr)
head(knnDescr)
nearZeroVar(solTrainXtrans)
dim(solTestXtrans)
testResults$Knn <- predict(knnTuneF, solTestXtrans[, names(knnDescr)])
cat("Training Errors:\n")
fullpred <- predict(enetTune, solTrainXtrans)
cat('Enet: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(nnetTune, solTrainXtrans)
cat('NNet: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(avNNetTuneF, trainXnnet)
cat('avNNet:',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),":\n")
fullpred <- predict(marsTune, solTrainXtrans)
cat('MARS: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(svmRTune, solTrainXtrans)
cat('SVMr: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2)," : ")
fullpred <- predict(svmPTune, solTrainXtrans)
cat('SVMp: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(knnTuneF, solTrainXtrans)
cat('kNN: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),"\n")
cat("Testing Errors:\n")
fullpred <- predict(enetTune, solTestXtrans)
cat('Enet: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2)," : ")
fullpred <- predict(nnetTune, solTestXtrans)
cat('NNet: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(avNNetTuneF, testXnnet)
cat('avNNet:',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),":\n")
fullpred <- predict(marsTune, solTestXtrans)
cat('MARS: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(svmRTune, solTestXtrans)
cat('SVMr: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(svmPTune, solTestXtrans)
cat('SVMp: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- testResults$Knn
cat('kNN: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),"\n")
# I will ignore the additional tuning for SVMp (didn't work)
testResults$SVMp <- predict(svmPTune, solTestXtrans)
temp <- sapply(2:ncol(testResults),
function(k){
sqrt(sum((testResults$obs - testResults[,k])^2)/nrow(testResults))})
names(temp) <- colnames(testResults)[2:ncol(testResults)]
temp
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Copyright 2013 Kuhn and Johnson
###
### Chapter 8: Regression Trees and Rule-Based Models
################################################################################
### Section 8.1 Basic Regression Trees
### Fit two CART models to show the initial splitting process. rpart
### only uses formulas, so we put the predictors and outcome into
### a common data frame first.
trainData <- solTrainXtrans
trainData$y <- solTrainY
rpStump <- rpart(y ~ ., data = trainData,
control = rpart.control(maxdepth = 1))
rpSmall <- rpart(y ~ ., data = trainData,
control = rpart.control(maxdepth = 2))
### Tune the model
set.seed(100)
cartTune <- train(x = solTrainXtrans, y = solTrainY,
method = "rpart",
tuneLength = 25,
trControl = ctrl)
# Note: to tune over maximum tree depth, use method = "rpart2" (rec. tuneLength=10)
cartTune
cartTune$finalModel
### Plot the tuning results
plot(cartTune, scales = list(x = list(log = 10)))
# (C) Kuhn and Johnson, 2013
### Use the partykit package to make some nice plots. First, convert
### the rpart objects to party objects.
cartTree <- as.party(cartTune$finalModel)
plot(cartTree)
# (C) Kuhn and Johnson, 2013
### Get the variable importance. 'competes' is an argument that
### controls whether splits not used in the tree should be included
### in the importance calculations.
cartImp <- varImp(cartTune, scale = FALSE, competes = FALSE)
cartImp
### Save the test set results in a data frame
testResults$CART = predict(cartTune, solTestXtrans)
### Tune the conditional inference tree
cGrid <- data.frame(mincriterion = sort(c(.95, seq(.5, .99, length = 3))))
set.seed(100)
ctreeTune <- train(x = solTrainXtrans, y = solTrainY,
method = "ctree",
tuneGrid = cGrid,
trControl = ctrl)
ctreeTune
plot(ctreeTune)
ctreeTune$finalModel
plot(ctreeTune$finalModel)
testResults$cTree <- predict(ctreeTune, solTestXtrans)
Note: I cannot install RWeka on my machine, so I'll skip this section and leave the code for your exploration.
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 8.2 Regression Model Trees and 8.3 Rule-Based Models
### Tune the model tree. Using method = "M5" actually tunes over the
### tree- and rule-based versions of the model. M = 10 is also passed
### in to make sure that there are larger terminal nodes for the
### regression models.
set.seed(100)
m5Tune <- train(x = solTrainXtrans, y = solTrainY,
method = "M5",
trControl = ctrl,
control = Weka_control(M = 10))
m5Tune
plot(m5Tune)
m5Tune$finalModel
testResults$M5 = predict(m5Tune, solTestXtrans))
plot(m5Tune$finalModel)
### Show the rule-based model too
ruleFit <- M5Rules(y~., data = trainData, control = Weka_control(M = 10))
ruleFit
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 8.4 Bagged Trees
### Optional: parallel processing can be used via the 'do' packages,
### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
### up the computations.
### WARNING: Be aware of how much memory is needed to parallel
### process. It can very quickly overwhelm the available hardware. The
### estimate of the median memory usage (VSIZE = total memory size)
### was 9706M for a core, but could range up to 9706M. This becomes
### severe when parallelizing randomForest() and (especially) calls
### to cforest().
### WARNING 2: The RWeka package does not work well with some forms of
### parallel processing, such as mutlicore (i.e. doMC).
registerDoMC(4)
set.seed(100)
treebagTune <- train(x = solTrainXtrans, y = solTrainY,
method = "treebag", nbagg = 25, # 'nbagg' is correct
trControl = ctrl)
treebagTune
testResults$Bag = predict(treebagTune, solTestXtrans)
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 8.5 Random Forests
mtryGrid <- data.frame(mtry = floor(c(.65,1,1.5)* ncol(solTrainXtrans)/3))
mtryGrid
This is no reason to use importance = TRUE during CV.
### Tune the model using cross-validation
set.seed(100)
rfTune <- train(x = solTrainXtrans, y = solTrainY,
method = "rf",
tuneGrid = mtryGrid,
ntree = 200,
importance = FALSE,
trControl = ctrl)
rfTune
plot(rfTune)
mtryGrid <- data.frame(mtry = floor(c(1.33,1.67)* ncol(solTrainXtrans)/3))
t(mtryGrid)
### Tune the model using cross-validation
set.seed(100)
rfTune <- train(x = solTrainXtrans, y = solTrainY,
method = "rf",
tuneGrid = mtryGrid,
ntree = 500,
importance = FALSE,
trControl = ctrl)
rfTune
### Tune the model using cross-validation
set.seed(100)
rfTuneImp <- train(x = solTrainXtrans, y = solTrainY,
method = "rf",
tuneGrid = data.frame(mtry=101),
ntree = 500,
importance = TRUE,
trControl = ctrl)
rfTuneImp
rfImp <- varImp(rfTuneImp, scale = FALSE)
rfImp
### Tune the model using the OOB estimates
ctrlOOB <- trainControl(method = "oob")
set.seed(100)
rfTuneOOB <- train(x = solTrainXtrans, y = solTrainY,
method = "rf",
tuneGrid = mtryGrid,
ntree = 500,
importance = FALSE,
trControl = ctrlOOB)
rfTuneOOB
### Tune the conditional inference forests
set.seed(100)
condrfTune <- train(x = solTrainXtrans, y = solTrainY,
method = "cforest",
tuneGrid = data.frame(mtry=120),
controls = cforest_unbiased(ntree = 500),
trControl = ctrl)
condrfTune
plot(condrfTune$results)
set.seed(100)
condrfTuneOOB <- train(x = solTrainXtrans, y = solTrainY,
method = "cforest",
tuneGrid = data.frame(mtry=c(80,100,120,140)),
controls = cforest_unbiased(ntree = 250),
trControl = trainControl(method = "oob"))
condrfTuneOOB
plotsize(6,5)
plot(condrfTuneOOB$results[,c(3,1)], type='l')
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 8.6 Boosting
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 3),
n.trees = seq(100, 700, by = 300),
shrinkage = c(0.002, 0.02, 0.2),
n.minobsinnode = 2)
# The gbmGrid has 27 combinations, and we should get a sense about
# the direction to take with a finer grid
set.seed(100)
gbmTune <- train(x = solTrainXtrans, y = solTrainY,
method = "gbm",
tuneGrid = gbmGrid,
trControl = ctrl4,
verbose = FALSE)
gbmTune
Note that tuning these methods is getting complicated. I'm tempted to start using some design of experiments techniques such as a latin square. We can do this by eliminating a lot of the rows from gbmGrid.
plot(gbmTune, auto.key = list(columns = 4, lines = TRUE))
gbmGrid <- expand.grid(interaction.depth = seq(3, 9, by = 3),
n.trees = seq(300, 1000, by = 350),
shrinkage = c(0.004, 0.008, 0.016),
n.minobsinnode = 2)
# The gbmGrid has 27 combinations, and we should get a sense about
# the direction to take with a finer grid
set.seed(100)
gbmTune <- train(x = solTrainXtrans, y = solTrainY,
method = "gbm",
tuneGrid = gbmGrid,
trControl = ctrl4,
verbose = FALSE)
gbmTune
plot(gbmTune, auto.key = list(columns = 4, lines = TRUE))
gbmImp <- varImp(gbmTune, scale = FALSE)
gbmImp
#save(list=ls(), file="data0623.Rdata")
load("data0623.Rdata")
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 8.7 Cubist
cbGrid <- expand.grid(committees = c(1:10, 20, 50, 75, 100),
neighbors = c(0, 1, 5, 9))
set.seed(100)
cubistTune <- train(solTrainXtrans, solTrainY,
"cubist",
tuneGrid = cbGrid,
trControl = ctrl)
cubistTune
plot(cubistTune, auto.key = list(columns = 4, lines = TRUE))
cbImp <- varImp(cubistTune, scale = FALSE)
cbImp
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Copyright 2013 Kuhn and Johnson
### Web Page: http://www.appliedpredictivemodeling.com
### Contact: Max Kuhn (mxkuhn@gmail.com)
###
### Chapter 10: Case Study: Compressive Strength of Concrete Mixtures
###
### Required packages: AppliedPredictiveModeling, caret, Cubist, doMC (optional),
### earth, elasticnet, gbm, ipred, lattice, nnet, party, pls,
### randomForests, rpart, RWeka
###
### Data used: The concrete from the AppliedPredictiveModeling package
data(concrete)
featurePlot(concrete[, -9], concrete$CompressiveStrength,
between = list(x = 1, y = 1),
type = c("g", "p", "smooth"))
### Copyright 2013 Kuhn and Johnson
################################################################################
### Section 10.1 Model Building Strategy
### There are replicated mixtures, so take the average per mixture
averaged <- ddply(mixtures,
.(Cement, BlastFurnaceSlag, FlyAsh, Water,
Superplasticizer, CoarseAggregate,
FineAggregate, Age),
function(x) c(CompressiveStrength =
mean(x$CompressiveStrength)))
### Split the data and create a control object for train()
set.seed(975)
inTrain <- createDataPartition(averaged$CompressiveStrength, p = 3/4)[[1]]
training <- averaged[ inTrain,]
testing <- averaged[-inTrain,]
ctrl <- trainControl(method = "repeatedcv", repeats = 5, number = 10)
### Create a model formula that can be used repeatedly
modForm <- paste("CompressiveStrength ~ (.)^2 + I(Cement^2) + I(BlastFurnaceSlag^2) +",
"I(FlyAsh^2) + I(Water^2) + I(Superplasticizer^2) +",
"I(CoarseAggregate^2) + I(FineAggregate^2) + I(Age^2)")
modForm <- as.formula(modForm)
### Fit the various models
### Optional: parallel processing can be used via the 'do' packages,
### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
### up the computations.
### WARNING: Be aware of how much memory is needed to parallel
### process. It can very quickly overwhelm the available hardware. The
### estimate of the median memory usage (VSIZE = total memory size)
### was 2800M for a core although the M5 calculations require about
### 3700M without parallel processing.
### WARNING 2: The RWeka package does not work well with some forms of
### parallel processing, such as mutlicore (i.e. doMC).
registerDoMC(14)
set.seed(669)
lmFit <- train(modForm, data = training,
method = "lm",
trControl = ctrl)
lmFit
set.seed(669)
plsFit <- train(modForm, data = training,
method = "pls",
preProc = c("center", "scale"),
tuneLength = 15,
trControl = ctrl)
plsFit
lassoGrid <- expand.grid(lambda = c(0, .001, .01, .1),
fraction = seq(0.05, 1, length = 20))
set.seed(669)
lassoFit <- train(modForm, data = training,
method = "enet",
preProc = c("center", "scale"),
tuneGrid = lassoGrid,
trControl = ctrl)
lassoFit
set.seed(669)
earthFit <- train(CompressiveStrength ~ ., data = training,
method = "earth",
tuneGrid = expand.grid(degree = 1,
nprune = 2:25),
trControl = ctrl)
earthFit
set.seed(669)
svmRFit <- train(CompressiveStrength ~ ., data = training,
method = "svmRadial",
tuneLength = 15,
preProc = c("center", "scale"),
trControl = ctrl)
svmRFit
nnetGrid <- expand.grid(decay = c(0.001, .01, .1),
size = seq(1, 27, by = 2),
bag = FALSE)
set.seed(669)
nnetFit <- train(CompressiveStrength ~ .,
data = training,
method = "avNNet",
tuneGrid = nnetGrid,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
maxit = 1000,
allowParallel = FALSE,
trControl = ctrl)
nnetFit
set.seed(669)
rpartFit <- train(CompressiveStrength ~ .,
data = training,
method = "rpart",
tuneLength = 30,
trControl = ctrl)
rpartFit
set.seed(669)
treebagFit <- train(CompressiveStrength ~ .,
data = training,
method = "treebag",
trControl = ctrl)
treebagFit
set.seed(669)
ctreeFit <- train(CompressiveStrength ~ .,
data = training,
method = "ctree",
tuneLength = 10,
trControl = ctrl)
ctreeFit
set.seed(669)
rfFit <- train(CompressiveStrength ~ .,
data = training,
method = "rf",
tuneLength = 10,
ntrees = 1000,
importance = TRUE,
trControl = ctrl)
rfFit
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1))
set.seed(669)
gbmFit <- train(CompressiveStrength ~ .,
data = training,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
gbmFit
cbGrid <- expand.grid(committees = c(1, 5, 10, 50, 75, 100),
neighbors = c(0, 1, 3, 5, 7, 9))
set.seed(669)
cbFit <- train(CompressiveStrength ~ .,
data = training,
method = "cubist",
tuneGrid = cbGrid,
trControl = ctrl)
cbFit
### Turn off the parallel processing to use RWeka.
registerDoSEQ()
set.seed(669)
mtFit <- train(CompressiveStrength ~ .,
data = training,
method = "M5",
trControl = ctrl)
mtFit
################################################################################
### Section 10.2 Model Performance
### Collect the resampling statistics across all the models
rs <- resamples(list("Linear Reg" = lmFit, "
PLS" = plsFit,
"Elastic Net" = lassoFit,
MARS = earthFit,
SVM = svmRFit,
"Neural Networks" = nnetFit,
CART = rpartFit,
"Cond Inf Tree" = ctreeFit,
"Bagged Tree" = treebagFit,
"Boosted Tree" = gbmFit,
"Random Forest" = rfFit,
Cubist = cbFit))
#parallelPlot(rs)
#parallelPlot(rs, metric = "Rsquared")
### Get the test set results across several models
nnetPred <- predict(nnetFit, testing)
gbmPred <- predict(gbmFit, testing)
cbPred <- predict(cbFit, testing)
testResults <- rbind(postResample(nnetPred, testing$CompressiveStrength),
postResample(gbmPred, testing$CompressiveStrength),
postResample(cbPred, testing$CompressiveStrength))
testResults <- as.data.frame(testResults)
testResults$Model <- c("Neural Networks", "Boosted Tree", "Cubist")
testResults <- testResults[order(testResults$RMSE),]
testResults
################################################################################
### Section 10.3 Optimizing Compressive Strength
### Create a function to maximize compressive strength* while keeping
### the predictor values as mixtures. Water (in x[7]) is used as the
### 'slack variable'.
### * We are actually minimizing the negative compressive strength
modelPrediction <- function(x, mod, limit = 2500)
{
if(x[1] < 0 | x[1] > 1) return(10^38)
if(x[2] < 0 | x[2] > 1) return(10^38)
if(x[3] < 0 | x[3] > 1) return(10^38)
if(x[4] < 0 | x[4] > 1) return(10^38)
if(x[5] < 0 | x[5] > 1) return(10^38)
if(x[6] < 0 | x[6] > 1) return(10^38)
x <- c(x, 1 - sum(x))
if(x[7] < 0.05) return(10^38)
tmp <- as.data.frame(t(x))
names(tmp) <- c('Cement','BlastFurnaceSlag','FlyAsh',
'Superplasticizer','CoarseAggregate',
'FineAggregate', 'Water')
tmp$Age <- 28
-predict(mod, tmp)
}
### Get mixtures at 28 days
subTrain <- subset(training, Age == 28)
### Center and scale the data to use dissimilarity sampling
pp1 <- preProcess(subTrain[, -(8:9)], c("center", "scale"))
scaledTrain <- predict(pp1, subTrain[, 1:7])
### Randomly select a few mixtures as a starting pool
set.seed(91)
startMixture <- sample(1:nrow(subTrain), 1)
starters <- scaledTrain[startMixture, 1:7]
pool <- scaledTrain
index <- maxDissim(starters, pool, 14)
startPoints <- c(startMixture, index)
starters <- subTrain[startPoints,1:7]
startingValues <- starters[, -4]
### For each starting mixture, optimize the Cubist model using
### a simplex search routine
cbResults <- startingValues
cbResults$Water <- NA
cbResults$Prediction <- NA
for(i in 1:nrow(cbResults))
{
results <- optim(unlist(cbResults[i,1:6]),
modelPrediction,
method = "Nelder-Mead",
control=list(maxit=5000),
mod = cbFit)
cbResults$Prediction[i] <- -results$value
cbResults[i,1:6] <- results$par
}
cbResults$Water <- 1 - apply(cbResults[,1:6], 1, sum)
cbResults <- subset(cbResults, Prediction > 0 & Water > .02)
cbResults <- cbResults[order(-cbResults$Prediction),][1:3,]
cbResults$Model <- "Cubist"
### Do the same for the neural network model
nnetResults <- startingValues
nnetResults$Water <- NA
nnetResults$Prediction <- NA
for(i in 1:nrow(nnetResults))
{
results <- optim(unlist(nnetResults[i, 1:6,]),
modelPrediction,
method = "Nelder-Mead",
control=list(maxit=5000),
mod = nnetFit)
nnetResults$Prediction[i] <- -results$value
nnetResults[i,1:6] <- results$par
}
nnetResults$Water <- 1 - apply(nnetResults[,1:6], 1, sum)
nnetResults <- subset(nnetResults, Prediction > 0 & Water > .02)
nnetResults <- nnetResults[order(-nnetResults$Prediction),][1:3,]
nnetResults$Model <- "NNet"
### Convert the predicted mixtures to PCA space and plot
pp2 <- preProcess(subTrain[, 1:7], "pca")
pca1 <- predict(pp2, subTrain[, 1:7])
pca1$Data <- "Training Set"
pca1$Data[startPoints] <- "Starting Values"
pca3 <- predict(pp2, cbResults[, names(subTrain[, 1:7])])
pca3$Data <- "Cubist"
pca4 <- predict(pp2, nnetResults[, names(subTrain[, 1:7])])
pca4$Data <- "Neural Network"
pcaData <- rbind(pca1, pca3, pca4)
pcaData$Data <- factor(pcaData$Data,
levels = c("Training Set","Starting Values",
"Cubist","Neural Network"))
lim <- extendrange(pcaData[, 1:2])
xyplot(PC2 ~ PC1,
data = pcaData,
groups = Data,
auto.key = list(columns = 2),
xlim = lim,
ylim = lim,
type = c("g", "p"))